Airbnb & Zillow Data in NYC Profit Analysis

1. Introduction

The following code is the whole process for the profitability analysis. To run this report, please read the README.md first.

1.1 Assumption Summary

Here are the assumptions this analysis report is using.

  1. The investor will pay for the property in cash.
  2. The time value of money discount rate is 0%.
  3. All properties and all square feet within each locale can be assumed to be homogeneous.
  4. Two bedrooms are most profitable.

    • using price, weekly_price, monthly_price calculated average price per square feet is constant within each neighborhood.
    • using current home median value calculated average cost per square feet is constant within each neighborhood.
  1. Only the following features can influence the profitability:

    • neighbourhood_group_cleansed
    • zipcode
    • latitude
    • longitude
    • bedrooms
    • price
    • weekly_price
    • monthly_price
    • availability_30
    • availability_60
    • availability_90
    • availability_365
    • square_feet
  1. For each neighbourhood_group_cleansed" partition, price variables (price, weekly_price, monthly_price) within 10% ~ 90% quantile are qualified for estimation.
  2. The airbnb customer ratio among single day, weekly and monthly is {'day' : 3,'week' : 1,'month' : 1}.
  3. If price variables (price, weekly_price, monthly_price) are missing or zero, it means the house doesn't support weekly or monthly rent.
  4. The availability variables (availability_365, availability_90, availability_60, availability_30) that are not zero or missing and within 10% ~ 90% quantile are qualified.
  5. The availability days is uniformed distributed within 365 days. So we can use $ \frac{availability\_90}{90} * 365$, $\frac{availability\_60}{60}*365$, $\frac{availability\_30}{30}*365$ to calculate availability_365, when availability_365 is not qualified.
  6. availability_365 calculation priority: $availability\_365 > availability\_90 > availability\_60 > availability\_30$
  7. Occupancy rate is 75%.
  8. The calculated past 5 years (2012-06 to 2017-06)'s month over month percentage change rates' average is the average growth rate from 2017-06 to now.
  9. The estimated yearly revenue is the annual profit.
  10. The annual profit is constant among the break even years.

Most of these assumptions can be tested directly from the pipeline. User can use the following module product_pipeline to test.

import product_pipeline

airbnb_revenue = feature_revenue_pipeline.set_params(...).fit_transform(...)
zillow_cost = current_value_pipeline..set_params(...).fit_transform(...)
final = profit_pipeline.set_params(...).fit_transform(airbnb_revenue, zillow_cost)

1.2 Evaluation Matrix Selection

  • Annual return on investment
  • Break even years

2. Packages Loading and Settings

In [1]:
from data_init.airbnb_data import AirbnbData
from data_init.zillow_data import ZillowData
import pandas as pd
import numpy as np
import scipy
from sklearn.preprocessing import FunctionTransformer, StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from util import *
import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.express as px
import re
import datetime
import matplotlib.pyplot as plt
In [2]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

3. Init Zillow and Airbnb Datasets

In [3]:
zillow = ZillowData()
zillow.loadZillowData()

airbnb = AirbnbData()
airbnb.fetchAirbnbData()
airbnb.loadAirbnbData()

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_columns', 500)

zillowInfo = zillow.data.describe().reset_index()
airbnbInfo = airbnb.data.describe().reset_index()

With the pre conclustion that 2 bedroom is the most profitable one. Let's limited Airbnb data to New York City 2 bedrooms first.

In [4]:
airbnb_filter = airbnb.data['bedrooms'] == 2.0
airbnb_data = airbnb.data[airbnb_filter].copy()

Limited Zillow Data to NYC

In [5]:
zillow_filter = zillow.data['City'] == 'New York'
zillow_data = zillow.data[zillow_filter].copy()

4. Airbnb Data Insight

In [6]:
### Data dimension
airbnb_data.shape
Out[6]:
(6497, 106)

Summary:There are {{airbnb_data.shape[0]}} rows and {{airbnb_data.shape[1]}} columns for two bedrooms airbnb data

In [7]:
airbnb_data.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6497 entries, 19 to 48873
Data columns (total 106 columns):
id                                              int64
listing_url                                     object
scrape_id                                       int64
last_scraped                                    object
name                                            object
summary                                         object
space                                           object
description                                     object
experiences_offered                             object
neighborhood_overview                           object
notes                                           object
transit                                         object
access                                          object
interaction                                     object
house_rules                                     object
thumbnail_url                                   float64
medium_url                                      float64
picture_url                                     object
xl_picture_url                                  float64
host_id                                         int64
host_url                                        object
host_name                                       object
host_since                                      object
host_location                                   object
host_about                                      object
host_response_time                              object
host_response_rate                              object
host_acceptance_rate                            float64
host_is_superhost                               object
host_thumbnail_url                              object
host_picture_url                                object
host_neighbourhood                              object
host_listings_count                             float64
host_total_listings_count                       float64
host_verifications                              object
host_has_profile_pic                            object
host_identity_verified                          object
street                                          object
neighbourhood                                   object
neighbourhood_cleansed                          object
neighbourhood_group_cleansed                    object
city                                            object
state                                           object
zipcode                                         object
market                                          object
smart_location                                  object
country_code                                    object
country                                         object
latitude                                        float64
longitude                                       float64
is_location_exact                               object
property_type                                   object
room_type                                       object
accommodates                                    int64
bathrooms                                       float64
bedrooms                                        float64
beds                                            float64
bed_type                                        object
amenities                                       object
square_feet                                     float64
price                                           object
weekly_price                                    object
monthly_price                                   object
security_deposit                                object
cleaning_fee                                    object
guests_included                                 int64
extra_people                                    object
minimum_nights                                  int64
maximum_nights                                  int64
minimum_minimum_nights                          int64
maximum_minimum_nights                          int64
minimum_maximum_nights                          int64
maximum_maximum_nights                          int64
minimum_nights_avg_ntm                          float64
maximum_nights_avg_ntm                          float64
calendar_updated                                object
has_availability                                object
availability_30                                 int64
availability_60                                 int64
availability_90                                 int64
availability_365                                int64
calendar_last_scraped                           object
number_of_reviews                               int64
number_of_reviews_ltm                           int64
first_review                                    object
last_review                                     object
review_scores_rating                            float64
review_scores_accuracy                          float64
review_scores_cleanliness                       float64
review_scores_checkin                           float64
review_scores_communication                     float64
review_scores_location                          float64
review_scores_value                             float64
requires_license                                object
license                                         object
jurisdiction_names                              object
instant_bookable                                object
is_business_travel_ready                        object
cancellation_policy                             object
require_guest_profile_picture                   object
require_guest_phone_verification                object
calculated_host_listings_count                  int64
calculated_host_listings_count_entire_homes     int64
calculated_host_listings_count_private_rooms    int64
calculated_host_listings_count_shared_rooms     int64
reviews_per_month                               float64
dtypes: float64(22), int64(21), object(63)
memory usage: 5.3+ MB

Assumption: With the real estate industry knowledge, only following fields has been taken as the init feature selection. We will go back to check this later.

  • neighbourhood_group_cleansed
  • zipcode
  • latitude
  • longitude
  • bedrooms
  • price
  • weekly_price
  • monthly_price
  • availability_30
  • availability_60
  • availability_90
  • availability_365
  • square_feet
In [8]:
selected_feature = [
    "neighbourhood_group_cleansed", "zipcode", "latitude", "longitude",
    "price", "weekly_price", "monthly_price", "availability_30",
    "availability_60", "availability_90", "availability_365", "square_feet"
]

selected_feature_ix = [
    airbnb_data.columns.get_loc(c) for c in selected_feature
]

Using sklearn FunctionTransformer to extract features, so that we can build a reusable pipeline.

In [9]:
def extract_feature(X):
    return X.iloc[:, selected_feature_ix]
In [10]:
feature_selector = FunctionTransformer(extract_feature, validate=False)
In [11]:
airbnb_data_selected_df = feature_selector.fit_transform(airbnb_data)

4.1 Airbnb Data Cleaning

Let's take a look at the basic information about the dataset

In [12]:
basicInfoAnalysis(airbnb_data_selected_df)
# of numeric columns: 7
# of object columns: 5
# of category columns: 0
# of bool columns: 0
********** Numeric Variable Insight **********
          latitude    longitude  availability_30  availability_60  \
count  6497.000000  6497.000000      6497.000000      6497.000000   
mean     40.724140   -73.956733         6.422503        16.311067   
std       0.052073     0.044105         9.053389        19.096649   
min      40.508730   -74.239140         0.000000         0.000000   
25%      40.686400   -73.985350         0.000000         0.000000   
50%      40.719610   -73.959580         2.000000         8.000000   
75%      40.759670   -73.941350        10.000000        28.000000   
max      40.905270   -73.716900        30.000000        60.000000   

       availability_90  availability_365  square_feet  
count      6497.000000       6497.000000   101.000000  
mean         27.042789        123.605048   844.762376  
std          29.811322        129.414869   468.567821  
min           0.000000          0.000000     0.000000  
25%           0.000000          0.000000   700.000000  
50%          16.000000         76.000000   850.000000  
75%          50.000000        247.000000  1000.000000  
max          90.000000        365.000000  2200.000000  

Summary: With no data type reference when we are importing the data, there are several data type we might want to fix.

  • Price related variables need to remove "$" and cast to numeric
  • zipcode with "-" or zipcode has only 4 digits need to be fixed

Also, there are missing values among these variables. Let's also take a look at the missing values

In [13]:
airbnb_missing = missingAnalysis(airbnb_data_selected_df)

Summary: only less than 1% of the zipcode is missing.

4.2 Airbnb Insight Pipeline Build

In [14]:
neighbourhood_group_cleansed_ix, zipcode_ix, latitude_ix, \
longitude_ix, price_ix, weekly_price_ix, monthly_price_ix, \
availability_30_ix, availability_60_ix, availability_90_ix, \
availability_365_ix, square_feet_ix = [
    list(airbnb_data.columns).index(col) for col in selected_feature
]


class FeatureCleaning(BaseEstimator, TransformerMixin):
    def __init__(self, price_ix, weekly_price_ix, monthly_price_ix,
                 zipcode_ix):
        self.price_ix = price_ix
        self.weekly_price_ix = weekly_price_ix
        self.monthly_price_ix = monthly_price_ix
        self.zipcode_ix = zipcode_ix

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        for ix in [self.price_ix, self.weekly_price_ix, self.monthly_price_ix]:
            X.iloc[:, ix] = X.iloc[:, ix].str.replace("$", "").str.replace(
                ",", "").astype(float)

        X.iloc[:, zipcode_ix] = X.iloc[:, zipcode_ix].str[:5]
        return X

Now, we can come up a pipeline for the basic preprocessing.

In [15]:
feature_pipeline = Pipeline([('feature_clean',
                              FeatureCleaning(price_ix, weekly_price_ix,
                                              monthly_price_ix, zipcode_ix))])
In [16]:
airbnb_prepared_df = feature_pipeline.fit_transform(airbnb.data.copy())
In [17]:
categorical_count_plot(airbnb_prepared_df, "zipcode")

Let's plot it on the map to see for missing values to see are there any missing pattern.

In [18]:
tdf = airbnb_prepared_df[airbnb_prepared_df["zipcode"].isnull()]
nyc_map_plot(tdf, size="price")

Summary:It seems the missing values randomly distributed on the map. So it should be safe to drop the missing record.

Let's re-define our feature cleaning class for the pipeline

In [19]:
class FeatureCleaning(BaseEstimator, TransformerMixin):
    def __init__(self, price_ix, weekly_price_ix, monthly_price_ix,
                 zipcode_ix):
        self.price_ix = price_ix
        self.weekly_price_ix = weekly_price_ix
        self.monthly_price_ix = monthly_price_ix
        self.zipcode_ix = zipcode_ix

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        for ix in [self.price_ix, self.weekly_price_ix, self.monthly_price_ix]:
            X.iloc[:, ix] = X.iloc[:, ix].str.replace("$", "").str.replace(
                ",", "").astype(float)

        X.iloc[:, zipcode_ix] = X.iloc[:, zipcode_ix].str[:5]
        X = X.loc[
            ~X.iloc[:, zipcode_ix].isnull(), :]  #drop missing zipcode record
        return X
In [20]:
feature_pipeline = Pipeline([
    ('feature_clean',
     FeatureCleaning(price_ix, weekly_price_ix, monthly_price_ix, zipcode_ix)),
])
In [21]:
airbnb_prepared_df = feature_pipeline.fit_transform(airbnb.data.copy())

4.3 Airbnb Square Feet Imputation

4.3.1 Square Feet Variable Insight

Let's take a look at the square_feet variable distribution first.

In [22]:
airbnb_prepared_df.square_feet.hist()
plt.show()
In [23]:
fig = go.Figure()
fig.add_trace(go.Box(x=airbnb_prepared_df.square_feet))
fig.show()

Summary: We see there is also square_feet == 0 records. We can also try to fill it.

Assumption: with the strong assumption that "All properties and all square feet within each locale can be assumed to be homogeneous". We can impute the "square_feet" variable with neighbourhood_group_cleansed and price variables (price, weekly_price, monthly_price)

We need take a look at the price distribution to find the qualified records

In [24]:
# Add histogram data
x1 = airbnb_prepared_df['price'].dropna()
x2 = airbnb_prepared_df['weekly_price'].dropna()
x3 = airbnb_prepared_df['monthly_price'].dropna()

# Group data together
hist_data = [x1, x2, x3]

group_labels = ['price', 'weekly price', 'monthly price']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=200)
fig.show()

Let's also limited to 2 bedrooms

In [25]:
def limit_bedrooms(X, n=2):
    return X[X['bedrooms'] == n]
In [26]:
feature_pipeline = Pipeline([('feature_clean',
                              FeatureCleaning(price_ix, weekly_price_ix,
                                              monthly_price_ix, zipcode_ix)),
                             ('limit_bedrooms',
                              FunctionTransformer(limit_bedrooms,
                                                  validate=False,
                                                  kw_args={'n': 2}))])
In [27]:
airbnb_prepared_final = feature_pipeline.fit_transform(airbnb.data.copy())

Let' take a look at the distribution of the price related data first.

In [28]:
# Add histogram data
hist_data = []
group_labels = []
for v in airbnb_prepared_final['neighbourhood_group_cleansed'].unique():
    x = airbnb_prepared_final.loc[
        airbnb_prepared_final['neighbourhood_group_cleansed'] ==
        v, 'price'].dropna()
    hist_data += [x]
    group_labels += [v]

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=20)
fig.show()

Summary: We see there are a lot extremly large values and 0s

Let's calculate new daily price with "price", "weekly_price" and "monthly_price" without outliers.

4. 4 Build Airbnb Qualified Price Variable

Assumption: Using 10% lower bond and 90% upper bond to remove the price variables outliers within "neighbourhood_group_cleansed" partition.

In [29]:
def get_val_prices_ix(
        X,
        price_ix,
        neighbourhood_group_cleansed_ix=neighbourhood_group_cleansed_ix,
        lb=.1,
        ub=.9):
    no_zero = X.iloc[:, price_ix] != 0
    val_ix = no_zero
    for v in X.iloc[:, neighbourhood_group_cleansed_ix].unique():
        ngc_ix = X.iloc[:, neighbourhood_group_cleansed_ix] == v
        val_ix[ngc_ix] = (no_zero[ngc_ix]) & (
            X.loc[no_zero & ngc_ix, X.columns[price_ix]].between(
                X.loc[no_zero & ngc_ix, X.columns[price_ix]].quantile(lb),
                X.loc[no_zero & ngc_ix, X.columns[price_ix]].quantile(ub)))

    return val_ix


X = airbnb_prepared_final
val_price_ix = get_val_prices_ix(X, price_ix)
val_weekly_price_ix = get_val_prices_ix(X, weekly_price_ix)
val_monthly_price_ix = get_val_prices_ix(X, monthly_price_ix)
In [30]:
# Add histogram data
hist_data = []
group_labels = []
for v in X.loc[val_price_ix, 'neighbourhood_group_cleansed'].unique():
    x = X.loc[(X['neighbourhood_group_cleansed'] == v) &
              (val_price_ix), 'price'].dropna()
    hist_data += [x]
    group_labels += [v]

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=20)
fig.update_layout(title='qualified daily price distribution')
fig.show()
In [31]:
# Add histogram data
hist_data = []
group_labels = []
for v in X.loc[val_weekly_price_ix, 'neighbourhood_group_cleansed'].unique():
    x = X.loc[(X['neighbourhood_group_cleansed'] == v) &
              (val_weekly_price_ix), 'weekly_price'].dropna()
    hist_data += [x]
    group_labels += [v]

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=100)
fig.update_layout(title='qualified weekly price distribution')
fig.show()
In [32]:
# Add histogram data
hist_data = []
group_labels = []
for v in X.loc[val_monthly_price_ix, 'neighbourhood_group_cleansed'].unique():
    if v != "Staten Island":
        x = X.loc[(X['neighbourhood_group_cleansed'] == v) &
                  (val_monthly_price_ix), 'monthly_price'].dropna()
        hist_data += [x]
        group_labels += [v]

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=200)
fig.update_layout(title='qualified monthly price distribution')
fig.show()

Summary: With removing outliers and zeros, the price related variables distribution looks much better now.

4.5 Build Square Feet Imputation Pipeline with Qualified Price Variables & Neighborhood Group Variable

In [33]:
class FeatureCleaning(BaseEstimator, TransformerMixin):
    def __init__(self, price_ix, weekly_price_ix, monthly_price_ix,
                 zipcode_ix):
        self.price_ix = price_ix
        self.weekly_price_ix = weekly_price_ix
        self.monthly_price_ix = monthly_price_ix
        self.zipcode_ix = zipcode_ix

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        for ix in [self.price_ix, self.weekly_price_ix, self.monthly_price_ix]:
            X.iloc[:, ix] = X.iloc[:, ix].str.replace("$", "").str.replace(
                ",", "").astype(float)

        X.iloc[:, zipcode_ix] = X.iloc[:, zipcode_ix].str[:5]
        X = X.loc[~X.iloc[:, zipcode_ix].isnull(), :]
        return X


class SquareFeetImputation(BaseEstimator, TransformerMixin):
    def __init__(self, lb, ub):
        self.lb = lb
        self.ub = ub

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Get qualified price index
        val_price_ix = get_val_prices_ix(X, price_ix,
                                         neighbourhood_group_cleansed_ix,
                                         self.lb, self.ub)
        val_weekly_price_ix = get_val_prices_ix(
            X, weekly_price_ix, neighbourhood_group_cleansed_ix, self.lb,
            self.ub)
        val_monthly_price_ix = get_val_prices_ix(
            X, monthly_price_ix, neighbourhood_group_cleansed_ix, self.lb,
            self.ub)
        # build price varibles and their qualified index as a key value pair dictionary
        val_col_row_pair = {
            price_ix: val_price_ix,
            weekly_price_ix: val_weekly_price_ix,
            monthly_price_ix: val_monthly_price_ix
        }
        # for each price variable (price, weekly price, monthly price)
        for col_ix in val_col_row_pair.keys():
            # each iteration will fill some square feet with priority: price>weekly>monthly
            val_row = val_col_row_pair[col_ix]
            # among validated price, weekly price, monthly_price rows, get no missing square_feet data
            sqrf_no_missing = X[(~X.iloc[:, square_feet_ix].isnull())
                                & (val_row) & (X.iloc[:, square_feet_ix] != 0)]
            # aggregate by neighbourhood, sum qualified records square feet and price
            agg_df = sqrf_no_missing.groupby(
                sqrf_no_missing.columns[neighbourhood_group_cleansed_ix]).agg({
                    sqrf_no_missing.columns[square_feet_ix]:
                    'sum',
                    sqrf_no_missing.columns[col_ix]:
                    'sum'
                }).reset_index()
            # calculate price per square feet
            agg_df['price_per_sqrf'] = agg_df[
                sqrf_no_missing.columns[col_ix]] / agg_df['square_feet']
            # update square feet missing index
            sqrf_missing_ix = (X.iloc[:, square_feet_ix].isnull()) | (
                X.iloc[:, square_feet_ix] == 0)  # missing or zero records
            # group by each neighbourhood
            for neighborhood in agg_df.neighbourhood_group_cleansed:
                X.loc[(X.neighbourhood_group_cleansed == neighborhood)
                      & (sqrf_missing_ix) &
                      (val_row), X.columns[square_feet_ix]] = X.loc[
                          (X.neighbourhood_group_cleansed == neighborhood)
                          & (sqrf_missing_ix)
                          & (val_row), X.columns[col_ix]] / agg_df.loc[
                              agg_df.neighbourhood_group_cleansed ==
                              neighborhood, 'price_per_sqrf'].values

        return X
In [34]:
feature_pipeline = Pipeline([
    ('feature_clean',
     FeatureCleaning(price_ix, weekly_price_ix, monthly_price_ix, zipcode_ix)),
    ('limit_bedrooms',
     FunctionTransformer(limit_bedrooms, validate=False, kw_args={'n': 2})),
    ('sqrf_impute', SquareFeetImputation(lb=.1, ub=.9)),
    ('feature_selector', FunctionTransformer(extract_feature, validate=False))
])
In [35]:
airbnb_prepared_final = feature_pipeline.fit_transform(airbnb.data.copy())

Let's check how many records have none of the 3 price variables are qualified, which will cause "square_feet" variable missing.

In [36]:
sqrf_missing = airbnb_prepared_final.square_feet.isnull().sum()
sqrf_missing
Out[36]:
1264

There are still some square_feet missing. Let's take a look at their distribution.

In [37]:
nyc_map_plot(airbnb_prepared_final[airbnb_prepared_final.square_feet.isnull()],
             size="price",
             color="zipcode")

Summary: Missing records seems randomly distributed on map. So it shouldn't be related to zipcode

It's safe to drop these records, and rewrite our SquareFeetImputation Class

Instead of saving 3 boolean price qualification columns to distinguish their complicated qualified situations, following code used one binary column to store.

X['collapses_price_val'] = val_price_ix.values.astype(
            int) + (val_weekly_price_ix.values.astype(
                int) << 1) + (val_monthly_price_ix.values.astype(int) << 2)

Situation Table:

monthly_qualified weekly_qualified daily_qualified binary_value stored_value
False False False 000 0
False False True 001 1
False True False 010 2
False True True 011 3
True False False 100 4
True False True 101 5
True True False 110 6
True True True 111 7
In [38]:
class SquareFeetImputation(BaseEstimator, TransformerMixin):
    def __init__(self, lb, ub):
        self.lb = lb
        self.ub = ub

    def fit(self, X, y=None):
        return self

    def transform(self, X):

        # Find index of the outliers regarding the price variables
        val_price_ix = get_val_prices_ix(X, price_ix,
                                         neighbourhood_group_cleansed_ix,
                                         self.lb, self.ub)
        val_weekly_price_ix = get_val_prices_ix(
            X, weekly_price_ix, neighbourhood_group_cleansed_ix, self.lb,
            self.ub)
        val_monthly_price_ix = get_val_prices_ix(
            X, monthly_price_ix, neighbourhood_group_cleansed_ix, self.lb,
            self.ub)
        # mark price variables are outlier or not.
        # use integer to store multiple boolean columns
        X['collapses_price_val'] = val_price_ix.values.astype(int) + (
            val_weekly_price_ix.values.astype(int) << 1) + (
                val_monthly_price_ix.values.astype(int) << 2)

        val_col_row_pair = {
            price_ix: val_price_ix,
            weekly_price_ix: val_weekly_price_ix,
            monthly_price_ix: val_monthly_price_ix
        }

        for col_ix in val_col_row_pair.keys():
            val_row = val_col_row_pair[col_ix]
            # among validated price, weekly price, monthly_price rows, get no missing square_feet data
            sqrf_no_missing = X[(~X.iloc[:, square_feet_ix].isnull())
                                & (val_row) & (X.iloc[:, square_feet_ix] != 0)]
            """
            group by neighbourhood_group_cleansed, sum square feet and one of the price variables 
            where this selected price is qualified
            """
            agg_df = sqrf_no_missing.groupby(
                sqrf_no_missing.columns[neighbourhood_group_cleansed_ix]).agg({
                    sqrf_no_missing.columns[square_feet_ix]:
                    'sum',
                    sqrf_no_missing.columns[col_ix]:
                    'sum'
                }).reset_index()

            agg_df['price_per_sqrf'] = agg_df[
                sqrf_no_missing.columns[col_ix]] / agg_df['square_feet']

            sqrf_missing_ix = (X.iloc[:, square_feet_ix].isnull()) | (
                X.iloc[:, square_feet_ix] == 0)
            """for each neighborhood using average price_per_sqrt is constant assumption
            calculating square feet with price > weekly_price > monthly_price"""

            for neighborhood in agg_df.neighbourhood_group_cleansed:
                X.loc[(X.neighbourhood_group_cleansed == neighborhood)
                      & (sqrf_missing_ix) &
                      (val_row), X.columns[square_feet_ix]] = X.loc[
                          (X.neighbourhood_group_cleansed == neighborhood)
                          & (sqrf_missing_ix)
                          & (val_row), X.columns[col_ix]] / agg_df.loc[
                              agg_df.neighbourhood_group_cleansed ==
                              neighborhood, 'price_per_sqrf'].values

        return X[~X.iloc[:, square_feet_ix].isnull(
        )]  # drop square feet missing according to previous plots
In [39]:
feature_pipeline = Pipeline([
    ('feature_clean',
     FeatureCleaning(price_ix, weekly_price_ix, monthly_price_ix, zipcode_ix)),
    ('limit_bedrooms',
     FunctionTransformer(limit_bedrooms, validate=False, kw_args={'n': 2})),
    ('sqrf_impute', SquareFeetImputation(lb=.1, ub=.9)),
])
In [40]:
airbnb_prepared_final = feature_pipeline.fit_transform(airbnb.data.copy())

5. Airbnb Revenue Calculation

Assumption: The airbnb customer ratio among single day, weekly and monthly is {'day' : 3,'week' : 1,'month' : 1}.

Assumption: If weekly price or monthly price are missing or zero, it means the house doesn't support weekly or monthly rent.

In [41]:
customer_ratio = {'day': 3, 'week': 1, 'month': 1}
In [42]:
X = airbnb_prepared_final
val_price_ix = X.iloc[:, price_ix] != 0
val_weekly_price_ix = (X.iloc[:, weekly_price_ix] !=
                       0) & (~X.iloc[:, weekly_price_ix].isnull())
val_monthly_price_ix = (X.iloc[:, monthly_price_ix] !=
                        0) & (~X.iloc[:, monthly_price_ix].isnull())

Assumption: The qualified availability variables should be not zero or missing and within 10% ~ 90% quantile

In [43]:
def get_val_avail_ix(X, avail_ix, lb=0.1, ub=0.9):
    no_missing_or_zero_ix = (X.iloc[:, avail_ix] !=
                             0) & (~X.iloc[:, avail_ix].isnull())
    return no_missing_or_zero_ix & (X.iloc[:, avail_ix].between(
        X.loc[no_missing_or_zero_ix, X.columns[avail_ix]].quantile(lb),
        X.loc[no_missing_or_zero_ix, X.columns[avail_ix]].quantile(ub)))


val_avail_30 = get_val_avail_ix(X, availability_30_ix)
val_avail_60 = get_val_avail_ix(X, availability_60_ix)
val_avail_90 = get_val_avail_ix(X, availability_90_ix)
val_avail_365 = get_val_avail_ix(X, availability_365_ix)
In [44]:
val_avail_365.sum(), val_avail_90.sum(), val_avail_60.sum(), val_avail_30.sum()
Out[44]:
(2909, 2786, 2702, 2281)

Summary: With all 4 variables related to availability, availability_365 has the largest number of qualified records.

Assumption: The availability days is uniformed distributed within 365 days. So we can use $ \frac{availability\_90}{90} * 365$, $\frac{availability\_60}{60}*365$, $\frac{availability\_30}{30}*365$ to calculate availability_365, when availability_365 is not qualified.

Assumption: availability_365 calculation priority: $availability\_365 > availability\_90 > availability\_60 > availability\_30$

Before we start calculating "availability_365", here are their distribution.

In [45]:
# Add histogram data
hist_data = []
group_labels = []
val_avail = [val_avail_365, val_avail_90, val_avail_60, val_avail_30]
avail_ix = [availability_365_ix, availability_365_ix, availability_365_ix]
In [46]:
hist_data = [
    X.loc[val_avail_365, X.columns[availability_365_ix]],
    X.loc[val_avail_90, X.columns[availability_90_ix]],
    X.loc[val_avail_60, X.columns[availability_60_ix]],
    X.loc[val_avail_30, X.columns[availability_30_ix]]
]
group_labels = [
    X.columns[availability_365_ix], X.columns[availability_90_ix],
    X.columns[availability_60_ix], X.columns[availability_30_ix]
]
# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=[10, 10, 7, 1])
fig.update_layout(title='qualified availabilities distribution')
fig.show()
In [47]:
fig = go.Figure()
for i in range(4):
    fig.add_trace(go.Box(x=hist_data[i], name=group_labels[i]))
fig.show()

Now from the plots, we can see it should be appropriate, if we use availability_365 to calculate the yearly revenue.

Let's take a look at the availability variables (90 days, 60 days, 30 days), when availability_365 is not qualified.

In [48]:
t = X[~val_avail_365]  # availability_365 not qualified dataset
t.availability_30.hist()
plt.title("availability_30 distribution when availability_365 not qualified")
plt.show()
t.availability_60.hist()
plt.title("availability_60 distribution when availability_365 not qualified")
plt.show()
t.availability_90.hist()
plt.title("availability_90 distribution when availability_365 not qualified")
plt.show()
X[val_avail_365].availability_365.hist()
plt.title("availability_365 distribution when availability_365 qualified")
plt.show()
In [49]:
nyc_map_plot(t, color='zipcode')

Summary: No specific pattern for unqualified availability_365. So we should be able to calculate the availability_365 when it's not qualified.

5.2 Build Pipeline to Calculate Availability 365

In [50]:
class CalAvail365(BaseEstimator, TransformerMixin):
    def __init__(self, lb, ub):
        self.lb = lb
        self.ub = ub

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # get qualified availability index
        val_avail_30 = get_val_avail_ix(X, availability_30_ix, self.lb,
                                        self.ub)
        val_avail_60 = get_val_avail_ix(X, availability_60_ix, self.lb,
                                        self.ub)
        val_avail_90 = get_val_avail_ix(X, availability_90_ix, self.lb,
                                        self.ub)
        val_avail_365 = get_val_avail_ix(X, availability_365_ix, self.lb,
                                         self.ub)
        # availability_365 not qualified, but availability_90 qualified
        X.loc[(~val_avail_365) &
              (val_avail_90), X.columns[availability_365_ix]] = X.loc[
                  (~val_avail_365) &
                  (val_avail_90), X.columns[availability_90_ix]] / 90 * 365
        # availability_365 and availability_90 not qualified, but availability_60 qualified
        X.loc[(~val_avail_365) & (~val_avail_90) &
              (val_avail_60), X.columns[availability_365_ix]] = X.loc[
                  (~val_avail_365) & (~val_avail_90) &
                  (val_avail_60), X.columns[availability_60_ix]] / 60 * 365
        # availability_365, availability_90 and availability_60 not qualified, but availability_30 qualified
        X.loc[(~val_avail_365) & (~val_avail_90) & (~val_avail_60) &
              (val_avail_30), X.columns[availability_365_ix]] = X.loc[
                  (~val_avail_365) & (~val_avail_90) & (~val_avail_60) &
                  (val_avail_30), X.columns[availability_30_ix]] / 30 * 365
        return X[get_val_avail_ix(X, availability_365_ix, self.lb, self.ub)]
In [51]:
ca365 = CalAvail365(0.1, 0.9)
X_tmp = ca365.fit_transform(X)
X_tmp.availability_365.hist()
plt.show()
X_tmp.shape
Out[51]:
(2899, 107)

After calculated using other availability, the distribution of the availability_365 is even better

Now we can build a separate pipeline for generating estimated revenue

With the customer_ratio assumption({'day' : 3,'week' : 1,'month' : 1}), we can calculating the estimated annual revenue using the formula showing below.

$$ customer\_weight\_matrix=\left(\begin{array}{cc} 0 & 0 & 0\\ 0 & 0 & 1\\ 0 & 1 & 0\\ 0 & \frac{1}{4} & \frac{3}{4}\\ 1 & 0 & 0\\ \frac{1}{4} & 1 & \frac{3}{4}\\ \frac{1}{2} & \frac{1}{2} & 0\\ \frac{1}{5} & \frac{1}{5} & \frac{3}{5}\\ \end{array}\right) $$

For i From 0 to 7:

{

$est_yearly_revenue[collapses_price_val==i] = (monthly_price[collapses_price_val==i] \times customer_weights_matrix[i,2]

  • weekly_price[collapses_price_val==i] \times customer_weights_matrix[i,1]
  • daily_price[collapses_price_val==i] \times customer_weights_matrix[i,0]) \times availability_365[collapses_price_val==i] \times occupancy_rate $

}

In [52]:
def yearly_revenue_generate(X, customer_ratio, occupancy_rate):
    # build customer weights matrix
    customer_weights_matrix = np.array(
        [[0, 0, 0], [0, 0, 1], [0, 1, 0],
         [
             0, customer_ratio['week'] /
             (customer_ratio['week'] + customer_ratio['day']),
             customer_ratio['day'] /
             (customer_ratio['week'] + customer_ratio['day'])
         ], [1, 0, 0],
         [
             customer_ratio['month'] /
             (customer_ratio['month'] + customer_ratio['day']), 0,
             customer_ratio['day'] /
             (customer_ratio['month'] + customer_ratio['day'])
         ],
         [
             customer_ratio['month'] /
             (customer_ratio['month'] + customer_ratio['week']),
             customer_ratio['week'] /
             (customer_ratio['month'] + customer_ratio['week']), 0
         ],
         [
             customer_ratio['month'] /
             (customer_ratio['month'] + customer_ratio['week'] +
              customer_ratio['day']), customer_ratio['week'] /
             (customer_ratio['month'] + customer_ratio['week'] +
              customer_ratio['day']), customer_ratio['day'] /
             (customer_ratio['month'] + customer_ratio['week'] +
              customer_ratio['day'])
         ]])
    # using np.dot calculate dot product between price matrix and weights matrix
    # then multiple availability and occupancy
    for i in X['collapses_price_val'].drop_duplicates():
        rix = X['collapses_price_val'] == i
        cix = [monthly_price_ix, weekly_price_ix, price_ix]
        X.loc[rix, 'est_yearly_revenue'] = np.dot(
            np.nan_to_num(X.loc[rix, X.columns[cix]].values),
            customer_weights_matrix[i, :]
        ) * X.loc[rix, X.columns[availability_365_ix]].values * occupancy_rate

    return X[X['collapses_price_val'] != 0]
In [53]:
revenue_pipeline = Pipeline([('avail_cal', CalAvail365(0.1, 0.9)),
                             ('gen_revenue',
                              FunctionTransformer(yearly_revenue_generate,
                                                  validate=False,
                                                  kw_args={
                                                      'customer_ratio': {
                                                          'day': 3,
                                                          'week': 1,
                                                          'month': 1
                                                      },
                                                      'occupancy_rate': 0.75
                                                  }))])
In [54]:
airbnb_revenue = revenue_pipeline.fit_transform(airbnb_prepared_final.copy())

6. Zillow Data Insight

In [55]:
time_col = [x for x in zillow_data.columns if re.search('\d{4}-\d{2}', x)]
In [56]:
zillow_data = zillow_data[["RegionName"] + time_col]
zillow_data.index = zillow_data.RegionName

Let's take a look at the missing information.

In [57]:
zillow_data_cleaned = zillow_data.drop("RegionName", axis=1)
In [58]:
zillow_df = zillow_data_cleaned.T
In [59]:
zillow_df_missing = missingAnalysis(zillow_df)
In [60]:
zillow_cleaned_missing = missingAnalysis(zillow_data_cleaned)

Let's also take a look at the time series data

In [61]:
fig = go.Figure()

for c in zillow_df.columns:
    fig.add_trace(go.Scatter(x=zillow_df.index, y=zillow_df[c], name=c))

fig.update_layout(title_text='Zillow Median Home Value',
                  xaxis_rangeslider_visible=True)
fig.show()

Summary: There is a relatively steady growth in the past 5 years (2012-06 to 2017-06).

Assumption: Using the past 5 years (2012-06 to 2017-06)'s month over month percentage change rates' average to predict current median home value.

In [62]:
zillow_pct_change = zillow_df.pct_change()[zillow_df.index > "2012-06"]

The months gap between "2012-06" and now:

In [63]:
last_date = zillow_pct_change.index[-1]
In [64]:
month_gap = (pd.Timestamp(datetime.datetime.now()).to_period('M') -
             pd.Timestamp(last_date).to_period("M"))

The price percentage change has happened since "2012-06".

Assumption: The average month over month growth rate from "2012-06" to now equals to the past 5 years month over month growth rate average.

In [65]:
zillow_pct_now = ((zillow_pct_change.mean() + 1)**int(
    month_gap.freqstr[:-1])).reset_index()
zillow_pct_now.columns = ['RegionName', 'est_pct']
In [66]:
zillow_value_now = zillow_data_cleaned.merge(zillow_pct_now,
                                             on='RegionName',
                                             how='left')
In [67]:
zillow_value_now['current_value'] = zillow_value_now[
    last_date] * zillow_value_now['est_pct']

6.1 Build Pipeline to Calculate Current Median Home Value

Now we can build a pipeline for calculating the current median value.

In [68]:
def current_median(X, window_start):
    time_col = [x for x in X.columns if re.search('\d{4}-\d{2}', x)]
    X = X[["RegionName"] + time_col]
    X.index = X.RegionName
    zillow_data_cleaned = X.drop("RegionName", axis=1)
    zillow_df = zillow_data_cleaned.T
    zillow_pct_change = zillow_df.pct_change()[zillow_df.index > window_start]
    last_date = zillow_pct_change.index[-1]
    month_gap = (pd.Timestamp(datetime.datetime.now()).to_period('M') -
                 pd.Timestamp(last_date).to_period("M"))
    zillow_pct_now = ((zillow_pct_change.mean() + 1)**int(
        month_gap.freqstr[:-1])).reset_index()
    zillow_pct_now.columns = ['RegionName', 'est_pct']
    zillow_value_now = zillow_data_cleaned.merge(zillow_pct_now,
                                                 on='RegionName',
                                                 how='left')
    zillow_value_now['current_value'] = zillow_value_now[
        last_date] * zillow_value_now['est_pct']
    zillow_value_now = zillow_value_now[["RegionName", "current_value"]]
    zillow_value_now.columns = ["zipcode", 'current_value']
    return zillow_value_now
In [69]:
current_value_pipeline = Pipeline([
    ("current_value",
     FunctionTransformer(current_median,
                         validate=False,
                         kw_args={'window_start': "2012-06"}))
])
In [70]:
cm = FunctionTransformer(current_median,
                         validate=False,
                         kw_args={'window_start': "2012-06"})
In [71]:
zillow_cost = cm.fit_transform(zillow_data.copy())

7. Profit Calculation

Now we have the median home value within each of these zipcode in the Zillow data. But obviously, there are still some of zipcode doesn't have median price.

Since we are holding the strong assumption that "All properties and all square feet within each locale can be assumed to be homogeneous". We can use the median home value within each neighbourhood and the median square feet variable to get other zipcode's median home value within the same neighbourhood whose median home value cannot directly get from zillow data.

Then we can use the value of square feet is constant in each one of the neighbourhood_group and square_feet to calculate the estimated cost for each house.

Assumption: Here we assume the estimated yearly revenue is the annual profit.

In [72]:
X = airbnb_revenue
Y = zillow_cost
X.zipcode = X.zipcode.astype("int")
revenue_cost = X.merge(Y, how="left", on="zipcode")
In [73]:
current_val_filled = (~revenue_cost.current_value.isnull()).sum()
current_val_filled
Out[73]:
668

Summary: Only these records' current median values have been directly filled. Let's check if all the neighbourhood is available for us, so we can calculate the rest of the zipcodes' median home value

In [74]:
revenue_cost.loc[~revenue_cost.current_value.isnull(
), "neighbourhood_group_cleansed"].drop_duplicates()
Out[74]:
7       Brooklyn
11     Manhattan
793       Queens
Name: neighbourhood_group_cleansed, dtype: object

Let's calculated other zipcode's est_current_value.

In [75]:
agg = revenue_cost[~revenue_cost.current_value.isnull()].groupby(
    revenue_cost.columns[neighbourhood_group_cleansed_ix]).agg({
        'current_value':
        'sum',
        revenue_cost.columns[square_feet_ix]:
        'sum'
    }).reset_index()
In [76]:
agg['avg_cost_per_square_feet'] = agg.iloc[:, 1] / agg.iloc[:, 2]
agg = agg[[
    revenue_cost.columns[neighbourhood_group_cleansed_ix],
    'avg_cost_per_square_feet'
]]
In [77]:
revenue_cost = revenue_cost.merge(
    agg, on=revenue_cost.columns[neighbourhood_group_cleansed_ix])
In [78]:
revenue_cost['est_cost'] = revenue_cost[
    'avg_cost_per_square_feet'] * revenue_cost.iloc[:, square_feet_ix]

Evaluation Matrix Selected:

* Break Even Year
* Annual Return on Investment (Annual ROI)

7.1.1 Break Even Year

Assumption: The annual profit is constant among the break even years

In [79]:
revenue_cost['break_even_year'] = revenue_cost['est_cost'] / revenue_cost[
    'est_yearly_revenue']
In [80]:
final_break_even = revenue_cost.groupby('zipcode')[[
    'break_even_year'
]].median().reset_index().sort_values(by="break_even_year")
In [81]:
final_break_even
Out[81]:
zipcode break_even_year
42 10282 10.723134
93 11367 14.550919
103 11379 14.654854
99 11374 16.030373
106 11412 16.406496
116 11426 20.204995
102 11378 20.313659
101 11377 21.656617
48 11105 21.710894
46 11103 21.799666
100 11375 21.943096
96 11370 22.012463
115 11423 22.796439
90 11358 23.052579
88 11355 23.052579
123 11436 23.314540
94 11368 23.490963
114 11422 23.996252
92 11365 24.911426
107 11413 25.329377
45 11102 25.329377
95 11369 25.414949
112 11419 25.487685
98 11373 26.136045
97 11372 26.136045
44 11101 28.018279
49 11106 28.495549
108 11414 28.495549
109 11416 28.804368
113 11421 30.177841
126 11692 30.515971
128 11694 31.717466
127 11693 32.207478
111 11418 33.634090
89 11356 34.938526
120 11433 35.071445
47 11104 37.645496
105 11411 38.306312
75 11228 40.219167
122 11435 41.509217
81 11234 41.824101
37 10065 44.731964
34 10039 45.349745
4 10005 45.399606
121 11434 45.791602
104 11385 45.990325
5 10007 46.977198
125 11691 47.165046
29 10034 47.494698
118 11429 47.515469
83 11236 48.273157
18 10023 48.388798
43 11003 49.438061
19 10024 49.661609
22 10027 50.071195
80 11233 51.332358
52 11204 51.621571
38 10075 51.774869
64 11216 52.053805
17 10022 52.633429
6 10009 53.131416
35 10040 53.364449
40 10162 53.583515
13 10017 54.311673
66 11218 55.732274
63 11215 55.998936
23 10028 56.069559
12 10016 56.329140
78 11231 56.814454
20 10025 57.259768
39 10128 57.694336
27 10032 57.938544
110 11417 59.468972
2 10003 60.128773
15 10019 60.234616
33 10038 60.538801
55 11207 60.957175
21 10026 61.298686
60 11212 61.924749
54 11206 61.926483
25 10030 62.927697
87 11354 63.671124
1 10002 63.839486
14 10018 64.376160
119 11432 64.546361
86 11249 65.020986
61 11213 65.202609
31 10036 65.243332
53 11205 65.567896
69 11221 65.751559
73 11225 66.391337
70 11222 66.498736
16 10021 66.870336
117 11428 67.452477
59 11211 67.651893
9 10012 67.974715
24 10029 68.740646
7 10010 70.330025
51 11203 70.504684
50 11201 71.364497
57 11209 72.245541
32 10037 72.392493
82 11235 73.148610
85 11238 73.841603
74 11226 74.846404
10 10013 74.880807
84 11237 77.269434
8 10011 77.497416
56 11208 78.025184
26 10031 78.254657
72 11224 83.598411
71 11223 84.542051
124 11559 87.305511
0 10001 89.541387
62 11214 90.388788
30 10035 95.055424
79 11232 95.152663
11 10014 96.018205
91 11361 97.699025
65 11217 98.147468
28 10033 100.554498
76 11229 102.266790
58 11210 104.498014
68 11220 110.035516
77 11230 141.009368
67 11219 160.325720
36 10044 166.672525
3 10004 178.927858
41 10280 238.570477
In [82]:
nyc_map_plot(revenue_cost, color="break_even_year")

Let's scale it and plot again.

In [83]:
scaler = MinMaxScaler(copy=False)
revenue_cost['scaled_break_even'] = scaler.fit_transform(
    revenue_cost['break_even_year'].values.reshape(-1, 1))
In [84]:
nyc_map_plot(revenue_cost, color="scaled_break_even")

7.1.2 ROI

In [85]:
revenue_cost['annual_roi'] = revenue_cost['est_yearly_revenue'] / revenue_cost[
    'est_cost'] * 100
final_roi = revenue_cost.groupby('zipcode')[[
    'annual_roi'
]].median().reset_index().sort_values(by="annual_roi", ascending=False)
final_roi
Out[85]:
zipcode annual_roi
42 10282 9.325632
93 11367 6.872418
103 11379 6.823678
99 11374 6.255038
106 11412 6.119648
116 11426 5.361461
96 11370 5.004030
95 11369 4.959352
102 11378 4.922796
94 11368 4.659462
46 11103 4.654723
101 11377 4.630353
48 11105 4.605983
113 11421 4.593120
100 11375 4.557242
115 11423 4.386650
90 11358 4.337909
88 11355 4.337909
123 11436 4.289169
92 11365 4.276984
114 11422 4.167318
122 11435 3.953401
128 11694 3.953401
89 11356 3.952724
45 11102 3.947985
107 11413 3.947985
112 11419 3.923615
98 11373 3.826134
97 11372 3.826134
44 11101 3.582540
108 11414 3.509320
49 11106 3.509320
109 11416 3.472765
126 11692 3.399654
127 11693 3.107210
111 11418 2.973174
105 11411 2.851323
120 11433 2.851323
47 11104 2.656360
75 11228 2.486377
81 11234 2.390966
119 11432 2.285391
37 10065 2.235538
34 10039 2.231428
4 10005 2.202662
121 11434 2.183848
104 11385 2.174370
5 10007 2.128692
29 10034 2.124583
125 11691 2.120214
118 11429 2.108029
83 11236 2.093341
18 10023 2.066594
43 11003 2.022733
19 10024 2.013628
22 10027 1.997190
80 11233 1.948089
52 11204 1.937175
38 10075 1.931439
64 11216 1.922456
17 10022 1.899933
40 10162 1.883952
6 10009 1.882126
35 10040 1.873907
13 10017 1.841259
66 11218 1.794292
63 11215 1.785748
23 10028 1.783499
12 10016 1.775280
78 11231 1.760115
20 10025 1.746514
39 10128 1.733272
27 10032 1.725967
110 11417 1.681549
2 10003 1.664325
15 10019 1.660216
33 10038 1.651997
55 11207 1.640496
21 10026 1.631450
60 11212 1.614863
54 11206 1.614863
25 10030 1.606793
87 11354 1.584068
1 10002 1.566726
14 10018 1.553370
86 11249 1.537965
61 11213 1.533693
31 10036 1.532823
53 11205 1.525149
69 11221 1.520876
73 11225 1.509892
70 11222 1.503788
16 10021 1.495838
117 11428 1.482525
59 11211 1.478155
9 10012 1.471181
24 10029 1.454743
7 10010 1.421868
51 11203 1.418345
50 11201 1.401257
32 10037 1.384883
57 11209 1.384168
82 11235 1.367080
85 11238 1.354264
74 11226 1.337175
10 10013 1.335569
84 11237 1.298726
8 10011 1.290366
26 10031 1.282147
56 11208 1.281637
72 11224 1.196195
71 11223 1.184824
124 11559 1.145403
0 10001 1.117769
62 11214 1.106480
76 11229 1.097936
30 10035 1.052018
79 11232 1.050943
11 10014 1.043799
91 11361 1.023552
65 11217 1.021038
28 10033 0.994486
58 11210 0.956956
68 11220 0.939867
77 11230 0.709173
67 11219 0.623730
36 10044 0.599979
3 10004 0.558884
41 10280 0.419163
In [86]:
nyc_map_plot(revenue_cost, color="annual_roi")

Let's scale it and plot again

In [87]:
revenue_cost['scaled_annual_roi'] = scaler.fit_transform(
    revenue_cost['annual_roi'].values.reshape(-1, 1))
In [88]:
nyc_map_plot(revenue_cost, color="scaled_break_even")

7.2 Build Profit Calculation Pipeline

In [89]:
class CalProfit(BaseEstimator, TransformerMixin):
    def __init__(self, X, Y):
        self.X = X
        self.Y = Y

    def fit(self, X, Y, y=None):
        return self

    def transform(self, X):
        self.X.zipcode = self.X.zipcode.astype("int")
        # left join airbnb data and zillow data
        revenue_cost = self.X.merge(self.Y, how="left", on="zipcode")
        # group by neighbourhood, sum revenue and square feet
        agg = revenue_cost[~revenue_cost.current_value.isnull()].groupby(
            revenue_cost.columns[neighbourhood_group_cleansed_ix]).agg({
                'current_value':
                'sum',
                revenue_cost.columns[square_feet_ix]:
                'sum'
            }).reset_index()
        # calculate average cost per square feet
        agg['avg_cost_per_square_feet'] = agg.iloc[:, 1] / agg.iloc[:, 2]
        agg = agg[[
            revenue_cost.columns[neighbourhood_group_cleansed_ix],
            'avg_cost_per_square_feet'
        ]]
        # using the assumption that average cost per square feet is constant in each neighbourhood
        # to calculate estimated cost for each house
        revenue_cost = revenue_cost.merge(
            agg, on=revenue_cost.columns[neighbourhood_group_cleansed_ix])
        revenue_cost['est_cost'] = revenue_cost[
            'avg_cost_per_square_feet'] * revenue_cost.iloc[:, square_feet_ix]
        return revenue_cost
In [90]:
class CalEvaluationMatrix(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['break_even_year'] = X['est_cost'] / X[
            'est_yearly_revenue']  #break even years calculation
        X['annual_roi'] = X['est_yearly_revenue'] / X[
            'est_cost'] * 100  #annual ROI calculation
        return X


def top_n_zipcode(X, variable: "['break_even_year', 'annual_roi']",
                  top_n) -> DataFrame:
    """
    calcualate the median of the selected variable, then get the top n records
    """
    final_roi = X.groupby('zipcode')[[
        variable
    ]].median().reset_index().sort_values(by=variable)
    return final_roi.head(top_n)[["zipcode", variable]]
In [91]:
profit_pipeline = Pipeline([('cal_profit',
                             CalProfit(airbnb_revenue, zillow_cost)),
                            ('cal_roi_year', CalEvaluationMatrix()),
                            ('top_n_zipcode',
                             FunctionTransformer(top_n_zipcode,
                                                 validate=False,
                                                 kw_args={
                                                     'variable': 'annual_roi',
                                                     'top_n': 10
                                                 }))])

Result: Top 10 zip code by annual ROI

In [92]:
profit_pipeline.fit_transform(airbnb_revenue, zillow_cost)
Out[92]:
zipcode annual_roi
41 10280 0.419163
3 10004 0.558884
36 10044 0.599979
67 11219 0.623730
77 11230 0.709173
68 11220 0.939867
58 11210 0.956956
28 10033 0.994486
65 11217 1.021038
91 11361 1.023552

Result: Top 20 zip code by number of break even years

In [93]:
profit_pipeline.set_params(top_n_zipcode__kw_args={
    'variable': 'break_even_year',
    'top_n': 20
}).fit_transform(airbnb_revenue, zillow_cost)
Out[93]:
zipcode break_even_year
42 10282 10.723134
93 11367 14.550919
103 11379 14.654854
99 11374 16.030373
106 11412 16.406496
116 11426 20.204995
102 11378 20.313659
101 11377 21.656617
48 11105 21.710894
46 11103 21.799666
100 11375 21.943096
96 11370 22.012463
115 11423 22.796439
90 11358 23.052579
88 11355 23.052579
123 11436 23.314540
94 11368 23.490963
114 11422 23.996252
92 11365 24.911426
107 11413 25.329377

What's next

  • If we have the specfic number of the daily price taker, weekly_price taker and monthly price taker instead of using the customer ratio assumption, we should be able to get a better revenue calculation.
  • We have some zip code missing instance, but we have longitude and latitude. With the help of Google API, we should be able to get the zip code directly.
  • Zillow median house value has only very limited zip codes. If we can get more median house values in other zip codes. We can calculate average value per square feet on zip code level not the neighborhood level.
  • We create a UI web page to let the customer to decide the business targets like which city is the best location to invest, how many numbers of the bedrooms is most profitable, what is the real customer ratio, or what evaluation matrix our client want to see. With product_pipeline module, just re-fit and transform the data with the set_params(). We can reproduce all the logic and get the results in seconds
  • The most updated median value of the house is about 2 years ago. And we can try to use time series model find the Trend and Seasonality. GridSerachCV the best period for the time series, then predict the current median value of the home instead of just using the past 5 years windows growth rate average.
  • The sample of the airbnb data is quite unbalanced regarding the zip code. We can try oversampling the minority zip codes or undersampling the majority zip code to get a better result.
  • Bronx and Staten Island have very limited records, and with my assumptions, all of them is not qualified for the later analysis steps. This create the risk of my analysis result, because none of them will be defined profitable which might not be the truth. If we can get more records or test weaker outlier assumptions, we might get a more convincible result. If not, we can try bootstrap resampling to generate more records, while still holding the original distribution.